Application of Weighted Voting Taggers to Languages Described with Large Tagsets

نویسندگان

  • Marcin Kuta
  • Jacek Kitowski
  • Wojciech Wójcik
  • Michal Wrzeszcz
چکیده

The paper presents baseline and complex part-of-speech taggers applied to the modified corpus of Frequency Dictionary of Contemporary Polish, annotated with a large tagset. First, the paper examines accuracy of 6 baseline part-of-speech taggers. The main part of the work presents simple weighted voting and complex voting taggers. Special attention is paid to lexical voting methods and issues of ties and fallbacks. TagPair and WPDV voting methods achieve the top accuracy among all considered methods. Error reduction 10.8% with respect to the best baseline tagger for the large tagset is comparable with other author’s results for small tagsets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards A Welsh Semantic Annotation System

Automatic semantic annotation of natural language data is an important task in Natural Language Processing, and a variety of semantic taggers have been developed for this task, particularly for English. However, for many languages, particularly for low-resource languages, such tools are yet to be developed. In this paper, we report on the development of an automatic Welsh semantic annotation to...

متن کامل

Do LSTMs really work so well for PoS tagging? - A replication study

A recent study by Plank et al. (2016) found that LSTM-based PoS taggers considerably improve over the current state-of-theart when evaluated on the corpora of the Universal Dependencies project that use a coarse-grained tagset. We replicate this study using a fresh collection of 27 corpora of 21 languages that are annotated with fine-grained tagsets of varying size. Our replication confirms the...

متن کامل

Data-Driven Part-of-Speech Tagging of Kiswahili

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining ...

متن کامل

On the Art of Taming and Exploiting Parallel Tags in a Multilingual Corpus1

Multilingual parallel corpora can be annotated with monolingual tools, such as morphosyntactic taggers. However, even taggers for typologically similar languages use incompatible tagsets, which results in a conceptual and formal variety of tags. Retraining taggers on data annotated with a common tagset is not a realistic option. However, differences between tagsets are often rooted in different...

متن کامل

Dialogue Acts: One or More Dimensions?

This report surveys the main theories of dialogue and communication that have been used to devise dialogue act tagsets, distinguishing theories that deal with a specific level of communication from theories that integrate several levels. The report proceeds to analyse four dialogue act tagsets that have been used to annotate large scale dialogue corpora (damsl, swbd-damsl, icsi-mrda and maltus)...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computing and Informatics

دوره 29  شماره 

صفحات  -

تاریخ انتشار 2010